Skip to content

[DEV-14238] - Transaction Loader Refactor#4577

Open
zachflanders-frb wants to merge 22 commits intoqatfrom
ftr/dev-14238-transaction-loader-refactor
Open

[DEV-14238] - Transaction Loader Refactor#4577
zachflanders-frb wants to merge 22 commits intoqatfrom
ftr/dev-14238-transaction-loader-refactor

Conversation

@zachflanders-frb
Copy link
Contributor

@zachflanders-frb zachflanders-frb commented Jan 14, 2026

Description:

This PR refactors the transaction loader flow to remove the lookup tables.

Technical Details:

This PR make the following changes to the transaction loader flow:

  • add a hash key to the published_fabs and detached_award_procurement delta tables
  • use the hash key and unique id column (afa_generated_unique/detached_award_proc_unique) for the merge with transaction fabs/fpds/normalized tables.
    • Using the unique keys allows us to eliminate the use of the transaction lookup tables
    • Adding the hash allows us to compare the entire table and eliminate the use of the last_etl_date. This ensures the entire tables are kept in sync.
  • Use WHEN NOT MATCHED BY SOURCE in the merge statement itself instead of having a separate delete function for transactions
  • Use the unique_award_key column to form the relationships between awards and transactions instead of using the award lookup table (eliminates need for the award lookup table)
  • splits the load_transactions_in_delta into separate commands
  • Reorganizes code using object-oriented patterns for better maintainability

Requirements for PR Merge:

  1. Unit & integration tests updated
  2. API documentation updated (examples listed below)
    1. API Contracts
    2. API UI
    3. Comments
  3. Data validation completed (examples listed below)
    1. Does this work well with the current frontend? Or is the frontend aware of a needed change?
    2. Is performance impacted in the changes (e.g., API, pipeline, downloads, etc.)?
    3. Is the expected data returned with the expected format?
  4. Appropriate Operations ticket(s) created
  5. Jira Ticket(s)
    1. DEV-0

Explain N/A in above checklist:

@zachflanders-frb zachflanders-frb marked this pull request as ready for review January 15, 2026 17:28
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left this step largely unchanged except I used the unique_award_key/generated_unique_award_id and unique_transaction_id to form the relationships between the awards and normalized transactions.

zachflanders-frb and others added 2 commits January 15, 2026 13:32
Co-authored-by: Andrew Guest <110476931+aguest-kc@users.noreply.github.com>
Comment on lines +57 to +58
table_exists = self.spark._jsparkSession.catalog().tableExists(f"int.awards")
if not table_exists:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could avoid creating a single-use variable here.

Suggested change
table_exists = self.spark._jsparkSession.catalog().tableExists(f"int.awards")
if not table_exists:
if not self.spark._jsparkSession.catalog().tableExists(f"int.awards"):

@zachflanders-frb zachflanders-frb changed the title Transaction Loader Refactor [DEV-14238] - Transaction Loader Refactor Jan 16, 2026
Comment on lines +322 to +325
super().load_transactions()
self.populate_award_ids()
self.populate_transaction_normalized_ids()
self.link_transactions_to_normalized()
Copy link
Contributor Author

@zachflanders-frb zachflanders-frb Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This involves 4 sequential executions of merge statements. I need to explore whether I can adjust this to two executions and whether that will be more performant.

Copy link
Contributor

@sethstoudenmier sethstoudenmier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An initial pass on the changes. Overall, I didn't see anything that I would block over. Will take another pass before approving once testing is done.

@@ -62,7 +62,7 @@ jobs:
with:
cov-report-name: 'spark-load-transactions-fabs-fpds-tests'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cov-report-name: 'spark-load-transactions-fabs-fpds-tests'
cov-report-name: 'spark-load-transactions-tests'

I believe we actually can remove the generation of these reports, however, to be consistent we should change the name for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we update the filename accordingly as well?

)

def handle(self, *args, **options):
with self.prepare_spark():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this use the defined function in usaspending_api/etl/transaction_delta_loaders/context_managers.py instead of the separately defined method below?

update_last_load_date("awards", next_last_load)

@contextmanager
def prepare_spark(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to my comment above; could this be removed and instead use the function in usaspending_api/etl/transaction_delta_loaders/context_managers.py?

subquery = """
SELECT awards.generated_unique_award_id AS id_to_remove
FROM int.awards
LEFT JOIN int.transaction_normalized on awards.transaction_unique_id = transaction_normalized.transaction_unique_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably isn't too bad performance wise, but it is possible that an EXISTS instead of a JOIN could help with performance some since we don't care about returning any of the values from int.transaction_normalized in this query.

I won't add this comment anywhere else, but the same could hold true. Being inside of a subquery may negate benefit of using EXISTS over a JOIN. If performance is looking good, then probably not worth exploring.


def delete_records_sql(self):
id_col = "generated_unique_award_id"
# TODO could do an outer join here to find awards that do not join to transaction fpds or transaction fabs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment saying that you could move away from Transaction Normalized to use FPDS and FABS directly? If so, do you mind clarifying in the comment.

Comment on lines +144 to +148
/* NOTE: In Postgres, the default sorting order sorts NULLs as larger than all other values.
However, in Spark, the default sorting order sorts NULLs as smaller than all other
values. In the Postgres transaction loader the default sorting behavior was used, so to
be consistent with the behavior of the previous loader, we need to reverse the default
Spark NULL sorting behavior for any field that can be NULL. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/* NOTE: In Postgres, the default sorting order sorts NULLs as larger than all other values.
However, in Spark, the default sorting order sorts NULLs as smaller than all other
values. In the Postgres transaction loader the default sorting behavior was used, so to
be consistent with the behavior of the previous loader, we need to reverse the default
Spark NULL sorting behavior for any field that can be NULL. */
-- NOTE: In Spark, the default sorting order sorts NULLs as smaller than all other values.

Now that we are so far removed from the Postgres pipeline, I feel that this comment can be updated to include simply how Spark handles NULL values.

SELECT
latest.id,
latest.unique_award_key,
0 AS subaward_count, -- for consistency with Postgres table
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
0 AS subaward_count, -- for consistency with Postgres table
0 AS subaward_count, -- default value that is updated later

Similar to other comment suggestion, since we are here and far removed from the Postgres pipeline it would be good to make this more meaningful.

@zachflanders-frb zachflanders-frb added the do not merge [PR] shouldn't be merged label Jan 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do not merge [PR] shouldn't be merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants